[Bugfix]: Fix cross attention backend selection for Turing GPU by Isotr0py · Pull Request #31806 · vllm-project/vllm

Isotr0py · 2026-01-06T12:45:33Z

Purpose

The default Attention backend for Turing is FlashInfer, so cross attn initialization will encounter error due to missing attn_type=AttentionType.ENCODER_DECODER when calling get_attn_backend:

(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/model_executor/models/whisper.py", line 577, in <lambda>
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     lambda prefix: WhisperDecoderLayer(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/model_executor/models/whisper.py", line 414, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     self.encoder_attn = WhisperCrossAttention(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]                         ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/model_executor/models/whisper.py", line 261, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     super().__init__(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/model_executor/models/whisper.py", line 190, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     self.attn = CrossAttention(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]                 ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/attention/layers/cross_attention.py", line 162, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     super().__init__(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/attention/layer.py", line 266, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     self.impl = impl_cls(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]                 ^^^^^^^^^
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]   File "/kaggle/working/vllm/vllm/v1/attention/backends/flashinfer.py", line 1152, in __init__
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888]     raise NotImplementedError(
(EngineCore_DP0 pid=185092) ERROR 01-06 12:32:35 [core.py:888] NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for FlashInferImpl

Test Plan

python examples/offline_inference/audio_language.py -m whisper

Test Result

Test should pass on T4 now.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist

Code Review

This pull request addresses a bug where an incorrect attention backend was selected for cross-attention on Turing GPUs, causing a NotImplementedError. The fix involves explicitly passing attn_type=AttentionType.ENCODER_DECODER when retrieving the attention backend in CrossAttention. This ensures that a backend supporting encoder-decoder attention is selected, resolving the issue. The change is correct and effectively fixes the bug. I have no further suggestions.

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fix

a08d944

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py requested a review from LucasWilkinson as a code owner January 6, 2026 12:45

Isotr0py requested a review from NickLucche January 6, 2026 12:45

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

noooop approved these changes Jan 6, 2026

View reviewed changes

Isotr0py merged commit 02809af into vllm-project:main Jan 6, 2026
48 of 49 checks passed

Isotr0py deleted the fix-whisper-attn-select branch January 6, 2026 15:16

LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026

[Bugfix]: Fix cross attention backend selection for Turing GPU (vllm-…

06c8fe1

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[Bugfix]: Fix cross attention backend selection for Turing GPU (vllm-…

23cfc52

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[Bugfix]: Fix cross attention backend selection for Turing GPU (vllm-…

ecc42af

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Bugfix]: Fix cross attention backend selection for Turing GPU (vllm-…

2c91fa1

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix]: Fix cross attention backend selection for Turing GPU (vllm-…

4ad997e

…project#31806) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix]: Fix cross attention backend selection for Turing GPU#31806

[Bugfix]: Fix cross attention backend selection for Turing GPU#31806
Isotr0py merged 1 commit intovllm-project:mainfrom
Isotr0py:fix-whisper-attn-select

Isotr0py commented Jan 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Isotr0py commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Isotr0py commented Jan 6, 2026 •

edited by github-actions bot

Loading